21 research outputs found

    Exploring processor parallelism: Estimation methods and optimization strategies,”

    Get PDF
    Abstract-Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom operations. This paper focuses on very long instruction word (VLIW) architectures and, more specifically, on automating the selection of an application specific VLIW issue-width. The issuewidth selection strongly influences all the important processor properties (e.g. processing speed, silicon area, and power consumption). Therefore, an accurate and efficient issue-width estimation and optimization are some of the most important aspects of VLIW ASIP design. In this paper, we first compare different methods for the estimation of required the issue-width, and subsequently introduce a new force-based parallelism estimation method which is capable of estimating the required issue-width with only 3% error on average. Furthermore, we present and compare two techniques for estimating the required issue-width of software pipelined loop kernels and show that a simple utilization-based measure provides an error margin of less than 1% on average

    Exploration de l'espace des architectures pour des systèmes de traitement d'image, analyse faite sur des blocs fondamentaux de la rétine numérique

    No full text
    In this dissertation we present a method able to run a Design Space Exploration oriented to the optimization of the data transfer and storage management. The corresponding developed tool has been used as a front-end of HLS in order to help the user to find an optimized memory micro- architecture. Our method is able to handle image processing applications with non- affine array refer- ences. It is able to apply a paving which, on one hand, is based on a run-time dependence analysis and, on the other hand, uses disjoint and equal-by- translation blocks to parti- tion the data and instruction sets. The non- affinity of the array references is taken into account by projecting the instruction paving on the data paving. This method leads to a memory micro-architecture that is, at the same time, adapted to the non- affinity of the array references of the application and has a cheap control on the data transfer because of the invariability of the size of transferred data blocks.Dans le cadre de la synthèse de haut niveau (SHN), qui permet d'extraire un modèle structural à partir d'un modèle algorithmique, nous proposons des solutions pour opti- miser l'accès et le transfert de données du matériel cible. Une méthodologie d'exploration de l'espace des architectures mémoire possibles a été mise au point. Cette méthodologie trouve un compromis entre la quantité de mémoire interne utilisée et les performances temporelles du matériel généré. Deux niveau d'optimisation existe : 1. Une optimisation architecturale, qui consiste à créer une hiérarchie mémoire, 2. Une optimisation algorithmique, qui consiste à partitionner la totalité des données manipulées pour stocker en interne seulement celles qui sont utiles dans l'immédiat. Pour chaque répartition possible, nous résolvons le problème de l'ordonnancement des calculs et de mapping des données. À la fin, nous choisissons la ou les solutions pareto. Nous proposons un outil, front-end de la SHN, qui est capable d'appliquer l'optimisation algorithmique du point 2 à un algorithme de traitement d'image spécifié par l'utilisateur. L'outil produit en sortie un modèle algorithmique optimisé pour la SHN, en customisant une architecture générique

    Rapid and accurate energy estimation of vector processing in VLIW ASIPs

    No full text
    Many modern applications in important application domains, as communication, image and video processing, multimedia, etc. involve much data-level parallelism (DLP). Therefore, adequate exploitation of DLP is highly relevant. This paper focuses on effective and efficient exploitation of DLP for the synthesis of vector VLIW ASIP processors. We propose analytical energy models in order to rapidly estimate the energy consumption of a nested loop executed on a VLIW ASIP with respect to different vector widths. The models perform a rapid and relatively accurate energy consumption estimation through combining the relevant information on the application and implementation technology. The analytical energy models are experimentally validated and the validation results are discussed

    Design Space Exploration in Application-Specific Hardware Synthesis for Multiple Communicating Nested Loops

    No full text
    International audienceApplication specific MPSoCs are often used to implement high-performance data-intensive applications. MPSoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application specific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consumption of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures

    Exploration de l'espace des architectures mémoire pour des systèmes de traitement d'image avec références non affines aux données (application à des blocs fondamentaux d'un modèle de rétine numérique)

    No full text
    Dans le cadre de la synthèse de haut niveau (SHN), qui permet d extraire un modèle structural à partir d un modèle algorithmique, nous proposons des solutions pour optimiser l accès et le transfert de données du matériel cible. Une méthodologie d exploration de l espace des architectures mémoire possibles a été mise au point. Cette méthodologie trouve un compromis entre la quantité de mémoire interne utilisée et les performances temporelles du matériel généré. Deux niveau d optimisation existe : 1)Une optimisation architecturale, qui consiste à créer une hiérarchie mémoire, 2)Une optimisation algorithmique, qui consiste à partitionner la totalité des données manipulées pour stocker en interne seulement celles qui sont utiles dans l immédiat. Pour chaque répartition possible, nous résolvonsle problème de l ordonnancement des calculs et de mapping des données. À la fin, nous choisissons la ou les solutions pareto. Nous proposons un outil, front-end de la SHN, qui est capable d appliquer l optimisation algorithmique du point 2) à un algorithme de traitement d image spécifié par l utilisateur. L outil produit en sortie un modèle algorithmique optimisé pour la SHN, en customisant une architecture générique.The aim of this Phd is to propose a methodology that improves the data transfer and management for applications having non-affine arra references.The target applications are iterative image processing algorithms which are non-recursive and have static dependences. These applications are weil described by a Joop based C-code and they can undergo a High Level Synthesis which in fer a RTL model from an input C-code. The input code of the HLS can be optimized, by the loop transformations, with respect to its data storage and management.ln fact, in the trame of polyhedral model, the loop transformations enhance data locality and allow the computation parallelism and the data prefetchin. These transformations require that the array references are affine.ln our model we proposes a method to apply data and operations partitionning for applications with non-affine array references. An exploration is run with different tiling of input and output data spaces. The output tiling is than projected onto the input tiling. The outpu tiles calculations are re-schedulied in order to minimize the internai memory or optimize the temporal performance of the produced system. A mapping between the input tiles and the internai buffers is computed and, at the end, the best solutions in the analyzed set are chosen.GRENOBLE1-BU Sciences (384212103) / SudocSudocFranceF

    Design space exploration in application-specific hardware synthesis for multiple communicating nested loops

    No full text
    Abstract—Application specific MPSoCs are often used to implement high-performance data-intensive applications. MP-SoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application specific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consumption of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures. I
    corecore